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1  Problems  studied 

1.  Tests  of  equality  of  variances. 

2.  Variable  selection  for  linear  models  with  high-dimensional  covariates. 

3.  Split  selection  methods  for  classification  trees. 

4.  Comparison  of  decision  trees  and  other  classification  methods. 

5.  Unbiased  piecewise-linear  regression  trees. 

2  Summary  of  important  results 

The  following  results  were  obtained  for  each  of  the  problems  listed  above. 

References  refer  to  the  list  of  publications  in  Section  3. 

1.  Seven  tests  of  equality  of  variances  were  compared  in  terms  of  robust¬ 
ness  and  power  in  a  simulation  experiment  with  small  to  moderate 
sample  sizes.  The  data  were  assumed  to  come  from  a  location-scale 
family  with  unknown  means,  variances,  and  density  functions.  The 
tests  considered  were  the  Levene  test,  the  Bartlett  test  with  and  with¬ 
out  kurtosis  adjustment,  the  Box-Andersen  test,  and  three  jackknife 
tests.  The  bootstrap  versions  of  these  tests  were  also  compared.  It 
is  found  that  the  Levene  test  and  one  jackknife  test,  as  well  as  the 
bootstrap  versions  of  the  Levene  test,  the  Bartlett  test  with  kurtosis 
adjustment,  and  two  jackknife  tests,  are  robust.  Among  these,  the 
bootstrap  version  of  the  Levene  test  tends  to  have  the  highest  power. 
The  results  are  published  in  [7]. 

2.  The  problem  is  that  of  variable  selection  in  linear  regression  models 
when  the  number  of  covariates  is  allowed  to  increase  with  the  sample 
size.  The  approach  in  [5]  for  the  fixed  design  situation  is  extended  to 
the  case  of  random  covariates.  This  yields  a  unified  consistent  selection 
criterion  for  both  random  and  fixed  covariates.  By  using  i-statistics  to 
order  the  covariates,  the  method  requires  much  less  computation  than 
an  all-subsets  search.  The  method  can  be  applied  to  autoregressive 
model  selection  with  increasing  order.  Simulation  experiments  were 
carried  out  to  validate  the  theory.  The  results  are  published  in  [8]. 
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3.  Classification  trees  based  on  exhaustive  search  algorithms  (such  as  AID 
and  CART)  tend  to  be  biased  towards  selecting  variables  that  allow 
more  splits.  As  a  result,  such  trees  need  to  be  interpreted  with  cau¬ 
tion.  An  algorithm  called  QUEST  that  has  negligible  selection  bias  was 
developed.  Its  split  selection  strategy  shares  similarities  with  the  FACT 
method,  but  it  yields  binary  splits  and  the  final  tree  can  be  selected 
by  a  direct  stopping  rule  or  by  pruning.  Real  and  simulated  data  were 
used  to  compare  QUEST  with  the  exhaustive  search  approach.  QUEST 
is  shown  to  be  substantially  faster  and  the  size  and  classification  accu¬ 
racy  of  its  trees  are  typically  comparable  to  those  of  exhaustive  search. 
The  results  are  reported  in  [9].  Compiled  executable  versions  of  the 
computer  program  are  available  for  downloading  from  the  Pi’s  home- 
page  (http://www.stat.wisc.edu/~loh/).  The  QUEST  algorithm 
has  been  adopted  by  the  commercial  software  publishers  of  SPSS  and 
STATISTICA  for  inclusion  in  their  packages. 

4.  Twenty  two  decision  tree,  nine  statistical,  and  two  neural  network  clas¬ 
sifiers  were  compared  on  thirty-two  datasets  in  terms  of  classification 
error  rate,  computational  time,  and  (in  the  case  of  trees)  number  of 
terminal  nodes.  It  is  found  that  the  average  error  rates  for  a  majority 
of  the  classifiers  are  not  statistically  significant  but  the  computational 
times  of  the  classifiers  differ  over  a  wide  range.  The  statistical  classifier 
POLYCLASS  based  on  a  logistic  regression  spline  algorithm  has  the 
lowest  average  error  rate.  However,  it  is  also  one  of  the  most  computa¬ 
tionally  intensive.  The  classifier  based  on  standard  polytomous  logistic 
regression  and  the  QUEST  classification  tree  with  linear  splits  have  the 
second  lowest  average  error  rates  but  are  about  50  times  faster  than 
POLYCLASS.  Among  decision  tree  classifiers  with  univariate  splits, 
the  classifiers  based  on  the  C4.5,  IND-CART,  and  QUEST  algorithms 
have  the  best  combination  of  error  rate  and  speed,  although  the  C4.5 
trees  tend  to  have  about  twice  as  many  nodes  as  those  from  the  other 
two  algorithms.  The  C4.5  classifier  based  on  rules  also  has  good  accu¬ 
racy,  but  it  does  not  scale  as  well  as  the  other  methods.  These  results 
are  reported  in  [11]. 

5.  A  piecewise-constant  regression  tree  model  can  be  valuable  for  the  in¬ 
sights  that  its  tree  structure  provides.  However,  the  standard  exhaus¬ 
tive  search  approach  to  tree  construction  has  three  weaknesses  that 
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limits  its  usefulness.  First,  it  possesses  a  variable  selection  bias  that 
can  lead  to  erroneous  conclusions.  Second,  the  piecewise-constant  trees 
tend  to  have  many  levels  of  splits,  which  hinder  interpretation.  Third, 
its  split  selection  criterion  focuses  only  on  one  predictor  variable  at 
a  time.  As  a  result,  it  may  fail  to  detect  interactions  between  two 
predictors,  or  require  more  than  one  split  to  uncover  them. 

An  alternative  approach,  called  GUIDE,  to  tree  construction  is  devel¬ 
oped  that  (1)  employs  significance  tests  and  the  bootstrap  to  correct  for 
biases  in  variable  selection,  (2)  permits  the  fitting  of  piecewise-linear 
models  to  reduce  tree  complexity,  and  (3)  chooses  splits  according  to 
measures  of  curvature  within  individual  predictors  as  well  as  interac¬ 
tions  between  pairs  of  predictors.  The  method  accepts  ordered  and 
unordered  predictor  variables,  with  unordered  variables  being  allowed 
to  split  the  nodes  but  not  participate  in  the  linear  model  equations. 
Simulation  experiments  show  that  the  selection  bias  of  the  exhaustive 
search  approach  can  be  quite  severe.  They  also  show  that  GUIDE  is 
effective  in  correcting  the  bias.  The  algorithm  and  results  are  reported 
in  [12]. 
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